Skip to content

Conversation

@dramaticlly
Copy link
Contributor

This patch add retainOrphanedDataFiles to the expire snapshots interface to allow retain the orphaned data files and also update both cleanup strategies. Most of time, we want to remove such data files as part of file clean up followed by snapshot expiration as those files are no longer referenced by any active snapshots.

However, when the underlying data files are shared/shallow copied to other table, or when parquet files being added by the add-files procedure https://iceberg.apache.org/docs/latest/spark-procedures/#add_files, we might want to keep the data files.

This option is disabled by default

Can you help take a look? @stevenzwu @amogh-jahagirdar

Update cleanup strategies to support retaining data files during snapshot expiration
@dramaticlly
Copy link
Contributor Author

FYI @huaxingao @singhpk234

* @param retain true to retain orphaned data files only reachable by expired snapshots
* @return this for method chaining
*/
default ExpireSnapshots retainOrphanedDataFiles(boolean retain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this isn't necessary to achieve the behavior of "clean up the metadata files but keep the data/delete files". There's a deleteWith API which allows for custom functions to delete based on paths. A client could just pass in a function which based on if a path is a path in the metadata location clean it up, otherwise don't do anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the passed in logic can be inverted, to skip cleanup if it's in the data location or some other heuristic but I think the point still stands

Copy link
Contributor Author

@dramaticlly dramaticlly Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @amogh-jahagirdar ! We actually explored that option and here's what we find

  1. use retainOrphanedDataFiles option actually speed up the clean up process by avoiding open and read the manifest files. If only metadata (like manifest-list and manifest) are considered for clean up, then we don't need to read the manifest entries, which is usually the bottleneck and requires work distribution. That's why most snapshot expiration is taken cared of in Spark action and procedures
  2. use DeleteWith consumer currently only provides a file path represented in String, we can use its file suffix to differentiate metadata and data files, but with introduction of Parquet Manifest - POC #13769, we can no longer rely on .parquet alone to tell. We can still probably rely on checking $tableLocation/data/ as part of file path but this is mostly conventional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dramaticlly , I mostly meant just using the table data location not a suffix as that wouldn't be sufficient but you're right that even the table data location is based on convention.
I agree though that specifying a custom cleanup function isn't the most ideal way to solve this use case.

I think the best argument for this kind of option is reducing costs (reducing unnecessary requests to read manifests, and the compute running when doing so) for cases where we know we only want to cleanup metadata and retain the data file.

The way I look at expressing this kind of logic is a bit different; rather than expressing which files we intend to retain, I look at it as which files should we cleanup. so something like a cleanExpiredFiles(CleanupMode mode)

and the options for CleanupMode like ALL, METADATA_ONLY, NONE. I think on this path we'd deprecate the cleanExpiredFiles(boolean) option as well, no point keeping both APIs?

If we were to go with the current approach we'd have to define precedence if someone uses the existing cleanExpiredFiles and the retainDataFiles, which is why it seems cleaner to me to define a cleanExpiredFiles API with some modes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar , I think what you proposed makes sense, as today retainDataFiles need to have prerequisites of cleanExpiredFiles=true.

But since those are public API on interface, deprecation and removal will need to go through cycles. I can update to use CleanupMode behind the scene.

* @param retain true to retain orphaned data files only reachable by expired snapshots
* @return this for method chaining
*/
default ExpireSnapshots retainOrphanedDataFiles(boolean retain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dramaticlly , I mostly meant just using the table data location not a suffix as that wouldn't be sufficient but you're right that even the table data location is based on convention.
I agree though that specifying a custom cleanup function isn't the most ideal way to solve this use case.

I think the best argument for this kind of option is reducing costs (reducing unnecessary requests to read manifests, and the compute running when doing so) for cases where we know we only want to cleanup metadata and retain the data file.

The way I look at expressing this kind of logic is a bit different; rather than expressing which files we intend to retain, I look at it as which files should we cleanup. so something like a cleanExpiredFiles(CleanupMode mode)

and the options for CleanupMode like ALL, METADATA_ONLY, NONE. I think on this path we'd deprecate the cleanExpiredFiles(boolean) option as well, no point keeping both APIs?

If we were to go with the current approach we'd have to define precedence if someone uses the existing cleanExpiredFiles and the retainDataFiles, which is why it seems cleaner to me to define a cleanExpiredFiles API with some modes?

deprecate cleanExpiredFiles method and update tests
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dramaticlly! the core logic looks right to me, just had some comments about if we could just undo any unrelated test refactorings in favor of another PR, and some other minor comments.

@amogh-jahagirdar amogh-jahagirdar changed the title [Core,api] Support retainOrphanedDataFiles in snapshot expiration Core, API: Support cleanupMode in snapshot expiration Oct 20, 2025
public ExpireSnapshots cleanExpiredFiles(boolean clean) {
this.cleanExpiredFiles = clean;
LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead.");
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about throwing here, why not just override whatever the cleanup mode was set to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eduard, I added such validation mainly want to prevent double intention where one set both the cleanExpired files as well as cleanupMode, so intention is not clear and override might be risky, this could lead to some unexpected results so throw early maybe helpful.

This can be found more in my unit test named testCannotSetCleanExpiredFilesAndCleanModeTogether in https://github.com/apache/iceberg/pull/14287/files?new_files_changed=true#diff-35ed4072da58b6d638da909476780a4da8d97390bd56e41f8555b15b91499b64R2071-R2091.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems ok to throw an exception here because we don't want to users to call both setters. only one should be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the condition may not be precise/perfect. e.g., if the client called cleanupLevel(ALL) (default value), it would still allow this method to go through.

the other way is to default cleanupLevel to null. but we would need to do a bit if-else check during read to apply the value.

maybe the complexity is not worth it. I am wondering if it is simpler to just go with what @nastra suggested. just rely on API deprecation to move users away from the old API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some nuance here and value to keep:

Take what I included in the unit test for an example here, use preconditions check here prevent the case where both cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY) and cleanExpiredFiles(false) is configured on snapshot expiration, as the intention is unclear at the moment, override to either NONE or METADATA_ONLY could potentially result in undesired results.

Although unlikely, for the corner case discussed here when client called cleanupLevel(ALL) and also set the cleanExpiredFiles

  • if cleanExpiredFiles = true, then cleanupLevel ends up resolve to ALL and logic is equivalent
  • if cleanExpiredFiles = false, we allow such override to happen and end up with cleanupLevel=None and retain all files, I think it's acceptable as we are moving from most restrictive and least restrictive, and those files can be later cleaned with orphan removal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current impleemntation is reasonable so long as there's no breakages when someone upgrades to 1.11 and using the deprecated API, which it looks like there's not since the condition is based on whether someone additionally set the new API. I also prefer the approach of failing if a non-default cleanup level is set and the old one is also set because it's pretty unlikely a user intended to do that and it forces them to resolve that ambiguity that @dramaticlly mentioned.

@Override
public ExpireSnapshots cleanMode(CleanupMode mode) {
Preconditions.checkNotNull(mode, "CleanupMode cannot be null");
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as mentioned earlier about throwing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this one, I do think this is a case where we should just be consistent with what's done in the other options in this API, which is just override what was previously set. e.g. we can set cleanExpiredFiles multiple times, and it'll just take the last. Any reason why this particular one should be different and throw @dramaticlly ?

Copy link
Contributor Author

@dramaticlly dramaticlly Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this mostly want to avoid ambiguous intention when user set both options together regardless of which option is chained first.

i.e both shall fail with exception to ask user for further clarification

table.expireSnapshots()
.cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY)
.cleanExpiredFiles(false)

and

table
.expireSnapshots()
.cleanExpiredFiles(false)
.cleanupLevel(ExpireSnapshots.CleanupLevel.METADATA_ONLY)

Updated preconditions to better assertion condition and message

public ExpireSnapshots cleanExpiredFiles(boolean clean) {
this.cleanExpiredFiles = clean;
LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead.");
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems ok to throw an exception here because we don't want to users to call both setters. only one should be used.

Address feedback from Amogh, Eduard and Steven
@dramaticlly
Copy link
Contributor Author

@nastra @amogh-jahagirdar @stevenzwu ready for another look

public ExpireSnapshots cleanExpiredFiles(boolean clean) {
this.cleanExpiredFiles = clean;
LOG.warn("cleanExpiredFiles(boolean) is deprecated. Use cleanMode(CleanupMode) instead.");
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current impleemntation is reasonable so long as there's no breakages when someone upgrades to 1.11 and using the deprecated API, which it looks like there's not since the condition is based on whether someone additionally set the new API. I also prefer the approach of failing if a non-default cleanup level is set and the old one is also set because it's pretty unlikely a user intended to do that and it forces them to resolve that ambiguity that @dramaticlly mentioned.

@Override
public ExpireSnapshots cleanMode(CleanupMode mode) {
Preconditions.checkNotNull(mode, "CleanupMode cannot be null");
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this one, I do think this is a case where we should just be consistent with what's done in the other options in this API, which is just override what was previously set. e.g. we can set cleanExpiredFiles multiple times, and it'll just take the last. Any reason why this particular one should be different and throw @dramaticlly ?

@dramaticlly
Copy link
Contributor Author

@amogh-jahagirdar @nastra can you take another look with the latest changes?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, this looks good just minor stuff, thank you @dramaticlly for this improvement!

super(fileIO, deleteExecutorService, planExecutorService, deleteFunc);
}

/** {@inheritDoc} */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this required? My understanding is that this is for subclasses so it inherits the parent docs and then adding additional context for the child class afterwards. I thought by default, if we just want the parent javadocs, we'll automatically inherit those? i.e. I thought only having inheritDoc by itself isn't required. It's useful when it's followed by some additional child class docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we dont really need this, I think it's the same for the compiler when override interface, some IDE like intellij will render this if one click the little button beside inherit doc to automatically show javadoc inline.

image

@stevenzwu stevenzwu merged commit 5cc1989 into apache:main Nov 21, 2025
45 checks passed
@stevenzwu
Copy link
Contributor

thanks @dramaticlly for the improvement, and @amogh-jahagirdar @nastra for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants